Figures and tables are saved in the ./figures/ and
./tables/ directories, respectively.
# Define folders structure and create folders if needed
figures_dir <- "./figures/"
tables_dir <- "./tables/"
dir.create(figures_dir,showWarnings = F)
dir.create(tables_dir,showWarnings = F)
We set some parameters for rendering.
knitr::opts_chunk$set(
# Display PNG but also keep PDF (reverse the order for PDF outputs)
dev = c("png", "pdf"),
# In general, keep all the figures produced in a chunk
fig.keep = "all",
# Save all figures
fig.path = figures_dir,
dpi = 72,
fig.width = 14, fig.height = 10,
# Cache chunk results and use cache if the chunk hasn't
# been changed since last rendering
cache = FALSE)
options(repr.plot.width = 14, repr.plot.height = 10)
We need ggplot2 and patchwork for the plot as well as biomaRt for gene ids conversion.
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(ggplot2)
library(patchwork)
library(biomaRt)
We set two colorblind-friendly palettes for the plots.
small_color_pal <- ggthemes::colorblind_pal()(8)
large_color_pal <- c("#F7F398", "#AA9A59", "#E63863", "#ff6db6", "#91D0BE", "#F3B1A0", "#57C3F3", "#AB3282", "#BD956A",
"#8C549C", "#6778AE", "#F1BB72", "#53A85F", "#9FA3A8", "#712820", "#58A4C3", "#E0D4CA", "#E4C755",
"#8C9A69", "#F28D35", "#44A08D", "#E6A8D7", "#D9B0C4")
We load the Seurat object containing processed scRNA-seq data.
sobj <- readRDS("seurat_subset_test.rds")
This Seurat object contains 5,000 cells within 4 assays
standard RNA expression assay
Spliced, spliced, unspliced and ambiguous transcript expression assays that have certainly been obtained using the velocyto tool. They can be used to investigate cell dynamics with a RNA velocity analysis.
sobj
## An object of class Seurat
## 131299 features across 5000 samples within 4 assays
## Active assay: RNA (33355 features, 2000 variable features)
## 3 layers present: counts, data, scale.data
## 3 other assays present: spliced, unspliced, ambiguous
## 2 dimensional reductions calculated: pca, umap
The RNA slot has been processed with a standard Seurat workflow to obtain a pca latent space from 2000 Highly variable genes which was used to cluster the cells and produce a UMAP visualization.
names(sobj@commands)
## [1] "NormalizeData.RNA" "FindVariableFeatures.RNA"
## [3] "ScaleData.RNA" "RunPCA.RNA"
## [5] "FindNeighbors.RNA.pca" "FindClusters"
## [7] "RunUMAP.RNA.pca"
We can assume that the cell type annotation present in the object is derived from this analysis. This cell type annotation presents 2 clusters of interneurons, the cell type of interest.
col_cell_type <- large_color_pal
names(col_cell_type) <- levels(sobj$Manual_type)
df_cluster_type <- unique(sobj@meta.data[,c("Manual_type","seurat_clusters")])
col_clusters <- large_color_pal
names(col_clusters) <- df_cluster_type[order(df_cluster_type$Manual_type),"seurat_clusters"]
plot_manual_type <- DimPlot(sobj,group.by = c("Manual_type"),cols = col_cell_type)
plot_clusters <- DimPlot(sobj,group.by = c("seurat_clusters"),cols = col_clusters)
wrap_plots(plot_manual_type,plot_clusters,ncol = 1)
The metadata contains other cell annotations.
sobj@meta.data[] <- lapply(sobj@meta.data, function(x) if(is.character(x)) as.factor(x) else x)
summary(sobj@meta.data)
## orig.ident nCount_RNA nFeature_RNA sample_sampleId_short
## hft:5000 Min. : 642 Min. : 501 hft_w16_p7_r2: 861
## 1st Qu.: 3018 1st Qu.:1528 hft_w16_p7_r1: 766
## Median : 4530 Median :2040 hft_w24_p6_r2: 689
## Mean : 5768 Mean :2249 hft_w21_p5_r2: 593
## 3rd Qu.: 6795 3rd Qu.:2732 hft_w20_p3_r2: 530
## Max. :147544 Max. :9889 hft_w21_p5_r1: 524
## (Other) :1037
## sample_time sample_cellLine sample_assay sample_batch percentRibo
## pcw16:1627 HFT3:1050 RNA NG:2833 b2019_06:1050 Min. :0.01381
## pcw20:1050 HFT5:1117 RNA v3:2167 b2020_02:2323 1st Qu.:0.12283
## pcw21:1117 HFT6:1206 b2020_03:1627 Median :0.15494
## pcw24:1206 HFT7:1627 Mean :0.15547
## 3rd Qu.:0.18616
## Max. :0.41943
##
## seurat_clusters Manual_type
## c0 : 993 Migrating glutamatergic neurons 1: 993
## c1 : 460 Interneurons 1 : 460
## c2 : 409 Migrating glutamatergic neurons 2: 409
## c3 : 378 Interneurons 2 : 378
## c5 : 332 Glutamatergic neurons 2 : 332
## c4 : 329 Glutamatergic neurons 1 : 329
## (Other):2099 (Other) :2099
barplot(summary(sobj@meta.data$sample_sampleId_short))
In summary we have 8 samples profiled at 4 different time points of the cortical development (week 16, 20, 21 and 24). There appear to be 2 replicates for each time point. The cells are also annotated for 3 different batches and two different assays (RNA v3, RNA NG).
table(sobj$sample_batch,sobj$sample_time)
##
## pcw16 pcw20 pcw21 pcw24
## b2019_06 0 1050 0 0
## b2020_02 0 0 1117 1206
## b2020_03 1627 0 0 0
table(sobj$sample_cellLine,sobj$sample_time) #these two annotation are equivalent
##
## pcw16 pcw20 pcw21 pcw24
## HFT3 0 1050 0 0
## HFT5 0 0 1117 0
## HFT6 0 0 0 1206
## HFT7 1627 0 0 0
DimPlot(sobj,group.by = c("sample_sampleId_short","sample_time",
"sample_assay","sample_batch"),cols = small_color_pal)
The cells cluster by batch on the UMAP visualization questioning the reliability of the given cell annotation. The batch annotation partially overlaps with the assay and sample time annotations. We have here a highly complex experimental design with potential batch effect that will be highly difficult to distinguish from meaningful biological variations during the cortical development.
Further information about how this data were generated is needed to decide whether or not a batch correction should be implemented and at which level(s) (sample, batch, assay).
table(sobj$sample_batch,sobj$sample_time)
##
## pcw16 pcw20 pcw21 pcw24
## b2019_06 0 1050 0 0
## b2020_02 0 0 1117 1206
## b2020_03 1627 0 0 0
table(sobj$sample_batch,sobj$sample_assay)
##
## RNA NG RNA v3
## b2019_06 0 1050
## b2020_02 1206 1117
## b2020_03 1627 0
As we only have the Seurat object, we don’t have the capture and sequencing reports that would allow us to have a complete view of the data quality. We can still perform basic quality control.
We create a new RNA assay with ENSG ids converted to HGNC gene names keeping only genes with at least one count.
rna_with_hgnc <- sobj[["RNA"]]$counts[rowSums(sobj[["RNA"]]$counts) > 0,]
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"),verbose = F)
ensg_to_hgnc <- getBM(filters= "ensembl_gene_id",
attributes= c("ensembl_gene_id","hgnc_symbol"),
values=rownames(rna_with_hgnc),
mart= mart,
verbose = F)
## Batch submitting query [=====>-------------------------] 20% eta: 2mBatch
## submitting query [===========>-------------------] 40% eta: 2mBatch submitting
## query [==================>------------] 60% eta: 1mBatch submitting query
## [========================>------] 80% eta: 23s
duplicated_ensg <- ensg_to_hgnc$ensembl_gene_id[duplicated(ensg_to_hgnc$ensembl_gene_id)]
ensg_to_hgnc[ensg_to_hgnc$ensembl_gene_id == duplicated_ensg,]
# when more than one gene name is found we keep only the first one
ensg_to_hgnc <- ensg_to_hgnc[!duplicated(ensg_to_hgnc),]
rna_with_hgnc <- rna_with_hgnc[ensg_to_hgnc$ensembl_gene_id,]
ensg_to_hgnc$gene_name <- ensg_to_hgnc$hgnc_symbol
# we keep ensg id when no gene name is available
ensg_to_hgnc$gene_name[ensg_to_hgnc$gene_name == ""] <- ensg_to_hgnc$ensembl_gene_id[ensg_to_hgnc$gene_name == ""]
rownames(rna_with_hgnc) <- make.unique(ensg_to_hgnc$gene_name)
sobj[["RNA_HGNC"]] <- CreateAssay5Object(counts = rna_with_hgnc)
DefaultAssay(sobj) <- "RNA_HGNC"
sobj <- NormalizeData(sobj)
## Normalizing layer: counts
Idents(sobj) <- "orig.ident"
# We have a percentRibo stat computed on the whole dataset (not subsetted)
# We can recompute this stat for the subssetted dataset
# We can also add the stat for the percentage of mitochondrial transcripts
sobj[["percent.mt"]] <- PercentageFeatureSet(sobj, pattern = "^MT-")
sobj[["percent.ribo"]] <- PercentageFeatureSet(sobj, pattern = "^RPS|^RPL")
VlnPlot(sobj, features = c("nFeature_RNA", "nCount_RNA","percent.mt","percent.ribo"), ncol = 2)
We have many cells with high percentages (> 15 %) of counts originating from mitochondrial (or ribosomal) transcripts. These cells are potentially of bad quality and should be filtered.
min(sobj$nFeature_RNA)
## [1] 501
max(sobj$nFeature_RNA)
## [1] 9889
min(sobj$nCount_RNA)
## [1] 642
max(sobj$nCount_RNA)
## [1] 147544
Furthermore, some cells with a high number of detected genes could be doublets. A cutoff of a minimum of 501 detected genes appears to have been applied to this dataset.
Further clarification is needed regarding how genes and cells were filtered, if at all. These preliminary results suggest that many low-quality cells may still be present. We can consider applying commonly used thresholds for the metrics mentioned above to retain only high-quality cells.
Additional advanced quality controls, such as automated doublet detection, could also be performed to further exclude dubious cells.
Using signatures provided by Seurat, we can assign each cell to a specific cell cycle phase (G2M/S or G1/G0) to assess potential confounding effects of the cell cycle on the processed data.
s_genes <- cc.genes$s.genes
g2m_genes <- cc.genes$g2m.genes
sobj <- CellCycleScoring(sobj, s.features = s_genes, g2m.features = g2m_genes, set.ident = TRUE)
## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms
## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms
p1 <- DimPlot(sobj,group.by = "Phase",cols = small_color_pal[2:4])
p2 <- DimPlot(sobj,group.by = c("Manual_type"),cols = col_cell_type,label = T,repel = T)
wrap_plots(p1,p2,ncol = 1)
Radial glia and progenitor cells cluster by cell cycle phase in this processed dataset, suggesting that the cell cycle is a major driver of gene expression in these cell types, potentially obscuring finer biological differences.
This issue can be addressed using various approaches, with cell cycle regression being the most common.
Given the primary focus on interneurons, I propose starting with an initial analysis that relies on the provided annotations to identify markers for the different annotated cell types, particularly for the two interneuron clusters.
## RNA_HGNC assay has been normalized before
Idents(sobj) <- "Manual_type"
all_markers <- FindAllMarkers(sobj, only.pos = T,logfc.threshold = 0.25,min.pct = 0.1,verbose = F)
all_markers <- all_markers[all_markers$p_val_adj <0.05,]
write.csv(all_markers,file = "./tables/all_markers_original_anno.csv",quote = F,row.names = F)
all_markers[all_markers$cluster == "Interneurons 1",]
all_markers[all_markers$cluster == "Interneurons 2",]
We can plot known interneuron markers from the literature, identified through this analysis, such as DLX genes, CXCR4, or ERBB4.
common_interneuron_markers <- intersect(all_markers[all_markers$cluster == "Interneurons 2","gene"],
all_markers[all_markers$cluster == "Interneurons 1","gene"])
common_interneuron_markers
## [1] "ERBB4" "DLX6-AS1" "PDZRN4" "DLX2" "CXCR4" "GAD1"
## [7] "SP9" "PLS3" "DLX5" "ARX" "NRXN3" "GAD2"
## [13] "PDE4DIP" "C11orf96" "RIPOR2" "DLX1" "RBP1" "SOX1"
## [19] "DLX6" "ARL4D" "DCX" "CEMIP2" "NPAS1" "SLAIN1"
## [25] "ZNF536" "INA" "SLC32A1" "BTG1" "TMEM123" "MAFB"
## [31] "CADPS" "PFN2" "FGD3" "ARL4C" "MOB3B" "ST8SIA5"
## [37] "ZEB2" "TCF4" "CELF4" "SLC6A1" "LINC01305" "BCL11B"
## [43] "QKI" "SOX4" "NNAT" "VSTM2A" "MEG3" "NPAS3"
## [49] "PNOC" "NREP" "MTSS1" "DPYSL3" "GPD1" "KIF2A"
## [55] "EIF1" "SHTN1" "RARA" "MYO1B" "MYCBP2" "DSCAML1"
## [61] "TMSB10" "FTH1" "TLE5" "YBX1" "H3-3B" "JAKMIP2"
## [67] "FEZ1" "DCLK2" "CCDC88A" "NCAM1" "WLS" "CRMP1"
## [73] "PDZRN3" "CITED2" "CHL1" "STMN2" "ANKRD12" "HSPA1A"
## [79] "EIF4G2" "NTN4" "AKAP9" "ARRDC3" "CHD9" "MLLT11"
## [85] "PNRC1" "SLCO5A1" "PAIP2" "DDX6" "DCLK1" "DST"
## [91] "DAPK1" "SOX2" "RPS29" "EIF5" "CD24" "CADM1"
## [97] "ZNF91" "TXN" "ATP1B1" "ZBTB16" "KLF7" "NSG1"
## [103] "WNT7A" "GAS2L3" "RERE" "DNAJB6" "ZNF428" "TSC22D3"
## [109] "CRACD" "ZNF704" "CCSER1"
FeaturePlot(sobj, features = c("DLX1","DLX2","CXCR4","ERBB4"),reduction = "umap")
We can already observe expression differences for some genes between the two interneuron clusters. Let’s identify all differentially expressed genes between interneuron clusters 1 and 2.
DEG_interneurons_1_vs_2 <- FindMarkers(sobj, ident.1 = "Interneurons 1",ident.2 = "Interneurons 2",
logfc.threshold = 0.25,min.pct = 0.1)
DEG_interneurons_1_vs_2 <- DEG_interneurons_1_vs_2[DEG_interneurons_1_vs_2$p_val_adj < 0.05,]
DEG_interneurons_1_vs_2$cluster <- "Up in Interneurons 1"
DEG_interneurons_1_vs_2$cluster[sign(DEG_interneurons_1_vs_2$avg_log2FC) < 0] <- "Up in Interneurons 2"
write.csv(DEG_interneurons_1_vs_2,
file = "./tables/deg_interneurons_1_vs_2_original_anno.csv",quote = F)
DEG_interneurons_1_vs_2
We can plot the top three significantly highly expressed genes for each interneuron type
top_interneurons_1 <- rownames(DEG_interneurons_1_vs_2[DEG_interneurons_1_vs_2$cluster == "Up in Interneurons 1",])[1:3]
top_interneurons_2 <- rownames(DEG_interneurons_1_vs_2[DEG_interneurons_1_vs_2$cluster == "Up in Interneurons 2",])[1:3]
VlnPlot(sobj[,sobj$Manual_type %in% c("Interneurons 1","Interneurons 2")],features = c(top_interneurons_1,top_interneurons_2),cols = col_cell_type,stack = T,fill.by = "ident",flip = T)
The two interneuron clusters exhibit clear differences in gene expression. Further DEG analysis could be conducted to investigate changes in gene expression between these two clusters at different time points during cortical development. This will require careful consideration of the complex experimental design, for example, by using EdgeR with pseudobulks of the different samples.
RNA velocity analysis, combined with pseudotime analysis, could provide valuable insights into common or distinct progenitors for these two types of interneurons during cortical development. However, this will be a highly challenging analysis due to the experimental design and the batch effects observed in the processed data.
Overall, this processed data presents a complex design with confounding factors, including batch effects at different levels and cell cycle. No corrections appear to have been applied during the processing of this data to obtain the cell type annotations.
If the two clusters annotated as interneurons exhibit known markers, I would like to discuss whether a complete reanalysis of this data—ideally starting from the raw FASTQ files (which were not provided)—should be performed, with potential corrections for confounding factors. This would allow for more reliable and potentially finer annotations of the interneuron subtypes. Such a reanalysis could serve as the basis for the more advanced analyses discussed above.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux trixie/sid
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.28.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biomaRt_2.62.1 patchwork_1.3.0 ggplot2_3.5.1 Seurat_5.2.1
## [5] SeuratObject_5.0.2 sp_2.2-0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_1.8.9 magrittr_2.0.3
## [4] spatstat.utils_3.1-2 farver_2.1.2 rmarkdown_2.29
## [7] zlibbioc_1.52.0 vctrs_0.6.5 ROCR_1.0-11
## [10] memoise_2.0.1 spatstat.explore_3.3-4 progress_1.2.3
## [13] htmltools_0.5.8.1 curl_6.2.0 sass_0.4.9
## [16] sctransform_0.4.1 parallelly_1.42.0 KernSmooth_2.23-26
## [19] bslib_0.9.0 htmlwidgets_1.6.4 ica_1.0-3
## [22] httr2_1.1.0 plyr_1.8.9 plotly_4.10.4
## [25] zoo_1.8-12 cachem_1.1.0 igraph_2.1.4
## [28] mime_0.12 lifecycle_1.0.4 pkgconfig_2.0.3
## [31] Matrix_1.7-2 R6_2.5.1 fastmap_1.2.0
## [34] GenomeInfoDbData_1.2.13 fitdistrplus_1.2-2 future_1.34.0
## [37] shiny_1.10.0 digest_0.6.37 colorspace_2.1-1
## [40] AnnotationDbi_1.68.0 S4Vectors_0.44.0 tensor_1.5
## [43] RSpectra_0.16-2 irlba_2.3.5.1 RSQLite_2.3.9
## [46] labeling_0.4.3 filelock_1.0.3 progressr_0.15.1
## [49] spatstat.sparse_3.1-0 httr_1.4.7 polyclip_1.10-7
## [52] abind_1.4-8 compiler_4.4.2 bit64_4.6.0-1
## [55] withr_3.0.2 DBI_1.2.3 fastDummies_1.7.5
## [58] MASS_7.3-64 rappdirs_0.3.3 tools_4.4.2
## [61] lmtest_0.9-40 httpuv_1.6.15 future.apply_1.11.3
## [64] goftest_1.2-3 glue_1.8.0 nlme_3.1-167
## [67] promises_1.3.2 grid_4.4.2 Rtsne_0.17
## [70] cluster_2.1.8 reshape2_1.4.4 generics_0.1.3
## [73] gtable_0.3.6 spatstat.data_3.1-4 tidyr_1.3.1
## [76] hms_1.1.3 data.table_1.16.4 xml2_1.3.6
## [79] XVector_0.46.0 BiocGenerics_0.52.0 spatstat.geom_3.3-5
## [82] RcppAnnoy_0.0.22 ggrepel_0.9.6 RANN_2.6.2
## [85] pillar_1.10.1 stringr_1.5.1 spam_2.11-1
## [88] RcppHNSW_0.6.0 later_1.4.1 splines_4.4.2
## [91] dplyr_1.1.4 BiocFileCache_2.14.0 lattice_0.22-6
## [94] survival_3.8-3 bit_4.5.0.1 deldir_2.0-4
## [97] tidyselect_1.2.1 Biostrings_2.74.1 miniUI_0.1.1.1
## [100] pbapply_1.7-2 knitr_1.49 gridExtra_2.3
## [103] IRanges_2.40.1 scattermore_1.2 stats4_4.4.2
## [106] xfun_0.50 Biobase_2.66.0 matrixStats_1.5.0
## [109] UCSC.utils_1.2.0 stringi_1.8.4 lazyeval_0.2.2
## [112] yaml_2.3.10 evaluate_1.0.3 codetools_0.2-20
## [115] tibble_3.2.1 cli_3.6.3 uwot_0.2.2
## [118] xtable_1.8-4 reticulate_1.40.0 munsell_0.5.1
## [121] jquerylib_0.1.4 GenomeInfoDb_1.42.3 Rcpp_1.0.14
## [124] globals_0.16.3 spatstat.random_3.3-2 dbplyr_2.5.0
## [127] png_0.1-8 spatstat.univar_3.1-1 parallel_4.4.2
## [130] blob_1.2.4 presto_1.0.0 prettyunits_1.2.0
## [133] dotCall64_1.2 listenv_0.9.1 ggthemes_5.1.0
## [136] viridisLite_0.4.2 scales_1.3.0 ggridges_0.5.6
## [139] crayon_1.5.3 purrr_1.0.2 rlang_1.1.5
## [142] KEGGREST_1.46.0 cowplot_1.1.3